A novel filtration method in biological sequence databases
نویسندگان
چکیده
In this paper, we propose a new filtration method, called Transformation-based Database Filtration method (TDF), to screen out those data sequences of a DNA sequence database which cannot satisfy a given query sequence. Our proposed method consists of two phases. First, we divide each data sequence into several windows (blocks), each of which is transformed into a data feature vector using the Haar wavelet transform. The transformed data feature vectors are then stored in an index file. Second, we divide a query sequence into sliding windows, each of which is, again, transformed into a query feature vector using the Haar wavelet transform. We then search the index file to find the candidate sequences for each query feature vector and check if they match the query sequence using the sequence alignment algorithm. We transform the bound of edit distance between sequences to the bound of Manhattan distance between feature vectors. Since the Manhattan distance is much easier to compute, our proposed method can efficiently screen out impossible data sequences and guarantee no false negatives. The experimental results show that our proposed method outperforms the QUASAR method in terms of filtration ratio, precision, execution time and index size. The proposed method also outperforms the YM method for long query, low complexity and repetitive data. 2006 Elsevier B.V. All rights reserved.
منابع مشابه
An Efficient Filtration Method in Biological Sequence Databases
Sequence comparison is one of the most important primitive operations in bioinformatics. Roughly speaking, this operation finds which parts of sequences are alike and which parts are different. As the size of a sequence database scales to millions of base pairs, it becomes impractical to search the whole database with sequence alignment methods based on the dynamic programming approach which yi...
متن کاملProtein Databases
Proteins are sources of many peptides with diverse biological activity. Some of them are considered as valuable components of foods and drug targets with desired and designed biological activity. We are now entering an era rich in biological data in which the field of bioinformatics is poised to exploit this information in increasingly powerful ways. There are currently many databases all over ...
متن کاملBFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Joins in Biological Databases
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole genome comparison into an approximate join operation in the wellestablished relational database context. We propose a ...
متن کاملCrossflow Filtration of Sodium Chloride Solution by A Polymeric Nanofilter: Minimization of Concentration Polarization by a Novel Backpulsing Method
In the present study, the production of low-salt water from salty water by nanofiltration as well as membrane fouling was investigated. Furthermore, a new method was proposed and tested experimentally for creating the backpulse in order to minimization of fouling and increase of the filtration efficiency. In the proposed method, the permeate was used instead of gas for creating the...
متن کاملEfficient Filtration of Sequence Homology Search through Singular Value Decomposition
Similarity search in textual databases and bioinformatics has received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of whole-genome sequence homology search into an approximate vector comparison in the well-established multidimensional vector...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Pattern Recognition Letters
دوره 28 شماره
صفحات -
تاریخ انتشار 2007